Computational Intelligence Methods for Clustering of Sense Tagged Nepali Documents
نویسندگان
چکیده
This paper presents a method using hybridization of self organizing map (SOM ), particle swarm optimization(PSO) and k-means clustering algorithm for document clustering. Document representation is an important step for clustering purposes. The common way of represent a text is bag of words approach. This approach is simple but has two drawbacks viz. synonymy and polysemy which arise because of the ambiguity of the words and the lack of information about the relations between the words. To avoid the drawbacks of bag of words approach words are tagged with senses in WordNet in this paper. Sense tagging of words provide exact senses of words. Feature vectors are generated using sense tagged documents and clustering is carried out using proposed hybrid SOM+PSO+K-means algorithm. In the proposed algorithm initially SOM is applied to the feature vectors to produce the prototypes and then K-means clustering algorithm is applied to cluster the prototypes. Particle Swarm Optimization algorithm is used to find the initial centroid for K-means algorithm. Text documents in Nepali language are used to test the hybrid SOM+PSO+K-means clustering algorithm.
منابع مشابه
A Comparative Analysis of Particle Swarm Optimization and K-means Algorithm For Text Clustering Using Nepali Wordnet
The volume of digitized text documents on the web have been increasing rapidly. As there is huge collection of data on the web there is a need for grouping(clustering) the documents into clusters for speedy information retrieval. Clustering of documents is collection of documents into groups such that the documents within each group are similar to each other and not to documents of other groups...
متن کاملSemantic Oriented Clustering of Documents
Semantic web-based approaches and computational intelligence can be merged in order to get useful tools for several data mining issues. In this work a web-based tagging process followed by a validation step is carried to tag WordNet adjectives with positive, neutral or negative moods. This tagged WordNet is used to define a semantic metric for text documents clustering. Experimental results on ...
متن کاملPAYMA: A Tagged Corpus of Persian Named Entities
The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...
متن کاملSome Challenges of Automated Annotation in A Multilingual Scenario
A key ingredient of today’s NLP scenario is annotation and this paper discusses challenges involved in one of the toughest annotation tasks which is sense marking. A large amount of data needs to be sense marked accurately by human annotators in order to train the machine to understand the spoken languages. The sense marked corpus for various languages facilitate the task of Word Sense Disambig...
متن کاملClustering techniques and discrete particle swarm optimization algorithm for multi-document summarization
Multi-document summarization is a process of automatic creation of a compressed version of a given collection of documents that provides useful information to users. In this article we propose a generic multi-document summarization method based on sentence clustering. We introduce five clustering methods, which optimize various aspects of intra-cluster similarity, inter-cluster dissimilarity an...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015